January 23, 2025
Problem Set #1 will be posted by end of day today!
Three problems:
Showing some properties of L2 regularization when P > N
Implementing cyclic coordinate descent as a method for finding the optimal solution for the LASSO
Building classifiers for a disease prediction data set
You may work in teams of up to three students
Each student must turn in a copy of the solutions on Canvas by the deadline
Note your teammates at the top of your solutions
Two files (minimum) for your submission
.ipynb or similar file
Rendered .html or .pdf of your solutions
Please don’t make me start a .ipynb server for your submission!
For proof style questions, you can probably get a lot of the solutions via ChatGPT
Why take this class if you’re just going to let ChatGPT solve the problems?
I think the results for the pen-and-paper parts are interesting and will force you to think differently about results related to regularization that you took as a given in your previous ML classes
Also, they’re just kinda cool!
For coding exercises, please use ChatGPT/Copilot as much as your little heart desires
It’s just the way the world works now!
sklearn is annoying in its setup
matplotlib is even worse
In the last couple of classes, we discussed the generic supervised predictive problem
Let’s assume that we have some outcome, y, we would like to predict and that each outcome is generated as a function of some features, \mathbf x.
Specifically: y = f(\mathbf x) + \epsilon
where f(\mathbf x) is some arbitrarily complex function.
We want to use the N instances of training data that we receive to learn f(\mathbf x) to the best of our ability!
In other words, we want a generalizable approximation to the true data generating function!
Denote our approximating function as \hat{f}(\mathbf x)
The risk of an approximation is computed w.r.t. a loss function - a quantitative measure of how good we expect our predictions to be
Generally, we want to select \hat{f}(\mathbf x) that minimizes the actual expected risk with respect to the data generating distribution:
\hat{f}(\mathbf x) = \underset{f^*(\mathbf x)}{\text{argmin }} E_\mathcal T[L(f(\mathbf x) , f^*(\mathbf x))]
where f^*(\mathbf x) is a candidate function of the feature vector.
We have knowledge of:
What influences the true expected risk (the size of the training data, the complexity of the approximating function, etc.)
How to closely approximate the unknowable true expected risk (train/validation/test splits, K-fold cross validation, etc.) for any candidate function
The goal, then, is to choose the correct function out of a set of candidate functions!
The search over all possible functions in \mathbb R^P is generally not possible
Instead, we’re almost always working with an infinite hypothesis space
The total number of points at which we would want to make a prediction is infinite
Just think about one continuous feature!
Formally, let \mathcal H be the hypothesis space of our search for \hat{f}(\mathbf x)
\mathcal H includes all of the possible functions that we’ll examine
Choose f^*(\mathbf x) \in \mathcal H that minimizes the generalization error
Typically \mathcal H is restricted to one class of model
The most common \mathcal H is the set of linear models
Linear model
Given a set of P features, a linear model is one where:
\hat{y}_0 = g(\phi(x_0)^T \hat{\boldsymbol \beta})
Thinking in terms of predictions from a training set, a linear model is one where:
\hat{y}_0 = g(\mathbf S_0 \mathbf y)
where \mathbf y is the collection of N training outcomes and \mathbf S_0 is a N \times N weighting matrix that is a function of \mathbf x_0 and \mathbf X.
The most common \mathcal H is the set of linear models
Linear model
For a standard linear model with no feature transformations fit via OLS:
\hat{y}_0 = \mathbf x_0^T \hat{\boldsymbol \beta}
\hat{y}_0 = \mathbf x_0^T (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y
\mathbf S_0 = \mathbf x_0^T (\mathbf X^T \mathbf X)^{-1} \mathbf X^T
We need to be really clear that linear models do not only include linear prediction surfaces!
Feature extraction
Let \mathbf x be a vector of features. Then, define the feature extractor as a function of the features:
\phi(\mathbf x)
With clever choice of feature extractor, the linear model can be used to induce really complex functions!
For a single feature model, we can define a feature extractor as the polynomial expansion of degree d
\phi(\mathbf x) = [1,x,x^2,...,x^d]
which induces the standard polynomial regression model.
For multifeature models, polynomial expansions tend to grow in size exponentially with the degree of the polynomial
Need to consider both expansions for each term and interactions between different features
Fully saturated models have roughly {N + d} \choose {d} terms!
Get around this with clever kernel functions like the polynomial kernel or RBF kernel
With good choices for kernel functions, we can represent a large and complicated feature extractor with a N \times N matrix
Polynomial Kernel:
K(\mathbf x , \mathbf z) = (\mathbf x^T \mathbf z + 1)^d
corresponds to a feature expansion for each instance as:
[x_1^2,x_2^2,x_1 x_2, x_1 x_2^2, x1^2 x_2, x1^2 x2^2, x_1, x_2, 1]
for a second degree kernel.
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
# Train a Polynomial Kernel Logistic Regression
model = make_pipeline(PolynomialFeatures(degree=15), LogisticRegression())
model.fit(X_train, y_train)
# Create a grid to evaluate model
x_min, x_max = X_train[:, 0].min() - 1, X_train[:, 0].max() + 1
y_min, y_max = X_train[:, 1].min() - 1, X_train[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
np.arange(y_min, y_max, 0.02))
# Predict on the grid
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot the contour and training examples
plt.contourf(xx, yy, Z, alpha=0.8)
#plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='viridis')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('15 degree polynomial kernel w/ Logistic Regression')
plt.show()RBF Kernel:
K(\mathbf x , \mathbf z) = \exp \left[ - \frac{\| \mathbf x - \mathbf z \|^2}{2\sigma^2} \right]
corresponds to an feature expansion for each instance that includes all polynomials and interactions.
from sklearn.linear_model import LogisticRegression
from sklearn.kernel_approximation import RBFSampler
from sklearn.pipeline import make_pipeline
# Train a RBF Kernel Logistic Regression
model = make_pipeline(RBFSampler(gamma=1, random_state=42), LogisticRegression())
model.fit(X_train, y_train)
# Create a grid to evaluate model
x_min, x_max = X_train[:, 0].min() - 1, X_train[:, 0].max() + 1
y_min, y_max = X_train[:, 1].min() - 1, X_train[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
np.arange(y_min, y_max, 0.02))
# Predict on the grid
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot the contour and training examples
plt.contourf(xx, yy, Z, alpha=0.8)
#plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='viridis')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('RBF Kernel Logistic Regression (sig2 = 1)')
plt.show()Note that work of making the linear model wiggly relies on choosing an appropriate feature extractor for the problem
For now, we’ll assume that the feature extractor is fixed and there is nothing related to it to learn
Neural networks are all about learning a complicated feature extractor
More to come soon.
The linear model:
\hat{y} = g(\phi(\mathbf x)^T \boldsymbol \beta)
is a hypothesis space over all possible linear models that can be specified by different values of \boldsymbol \beta.
We want to choose f^*(\mathbf x) \in \mathcal H that performs well for unseen data.
For the linear model, this amounts to choosing \hat{\beta} \in \mathbb R^P that minimizes the generalization error.
An approach:
Our expectation:
E_{\mathcal T}[L(y , g(\phi(\mathbf x)^T \boldsymbol \beta^*))]
is a long run average. Given N samples from \mathcal T, we can approximate this expectation with the empirical risk:
\frac{1}{N} \sum \limits_{i = 1}^N L(y_i , g(\phi(\mathbf x_i)^T \boldsymbol \beta^*)) \overset{p}{\longrightarrow}E_{\mathcal T}[L(y , g(\phi(\mathbf x)^T \boldsymbol \beta^*))]
Therefore, it suffices to select coefficients that minimize the empirical risk or in-sample error.
\hat{\boldsymbol \beta} = \underset{\boldsymbol \beta^*}{\text{argmin }} \frac{1}{N} \sum \limits_{i = 1}^N L(y_i , g(\phi(\mathbf x_i)^T \boldsymbol \beta^*))
However, we know that the in-sample error when N is not infinity tends to underestimate the generalization error.
To deal with this asymmetry, it is common to add a penalty or regularizer term that accounts for the gap between the in-sample error and the generalization error:
\hat{\boldsymbol \beta} = \underset{\boldsymbol \beta^*}{\text{argmin }} \frac{1}{N} \sum \limits_{i = 1}^N L(y_i , g(\phi(\mathbf x_i)^T \boldsymbol \beta^*)) + \lambda C(\boldsymbol \beta^*)
where C(\cdot) is a measure of the model complexity and \lambda is a complexity knob that controls the penalty on complexity as a function of N (or just chosen using CV or other methods).
This definition is quite broad!
Defines a two-step training procedure:
Find the ERM for the model given any tuning parmaeters
Choose the tuning parameter to minimize the generalization error
Commonly seen linear models in ML are broken into three categories:
Continuous outcome: \phi(\mathbf x)^T \boldsymbol \beta \in \mathbb R^M
Binary outcome: g(\phi(\mathbf x)^T \boldsymbol \beta) \in (0,1)
Unordered categorical outcomes: g(\phi(\mathbf x)^T \boldsymbol \beta_k) \in (1,2,...,K)
There are other types of outcomes, but these are the most commonly seen
In each case, the coefficients are found using ERM with respect to a loss function.
Given training data, \mathbf X and \mathbf y, find the coefficients that minimize the loss:
Linear Regression
For a continuous outcome, find \hat{\boldsymbol \beta} that minimizes the means squared error loss:
\hat{\boldsymbol \beta} = \underset{\boldsymbol \beta^*}{\text{argmin }} \frac{1}{N} \sum \limits_{i = 1}^N (y_i - \phi(\mathbf x_i)^T \boldsymbol \beta^*)^2
Binary Logistic Regression
For a binary outcome, y_i \in (0,1) find \hat{\boldsymbol \beta} that minimizes the probability loss:
\hat{\boldsymbol \beta} = \underset{\boldsymbol \beta^*}{\text{argmin }} \frac{1}{N} \sum \limits_{i = 1}^N 1 - I(y_i = 1)\sigma(\phi(\mathbf x_i)^T \boldsymbol \beta^*) - I(y_i = 0)(1 - \sigma(\phi(\mathbf x_i)^T \boldsymbol \beta^*))
where \sigma(z) is the binary logistic function:
\sigma(z) = \frac{\exp[z]}{1 + \exp[z]}
and equates to the probability that y_i = 1 given our model choice and \mathbf x.
Multinomial Logistic Regression
For a categorical outcome, y_i \in (1,2,...,K) find \hat{\boldsymbol \beta_k} \forall k \in K that minimizes the probability loss:
\hat{\boldsymbol \beta} = \underset{\boldsymbol \beta^*}{\text{argmin }} \frac{1}{N} \sum \limits_{i = 1}^N \sum \limits_{k = 1}^K I(y_i \neq k) \sigma_k(\phi(\mathbf x_i)^T \boldsymbol \beta_k)
where there are K P-vectors of coefficients and \sigma_k(\mathbf z) is the softmax function for category k:
\sigma_k(\mathbf z) = \frac{\exp[z_k]}{\sum \limits_{h = 1}^K \exp[z_h]}
and equates to the probability that y_i = k given our model choice and \mathbf x.
Each of these are common linear models
Though it may appear that these are all very different methods, they all actually fall under the umbrella of the generalized linear model framework
We’ll discuss this briefly, but the major implication is that each of these models corresponds to a maximum likelihood estimate from a distribution in the exponential family
We’ll start with a familiar setting:
Suppose we have N i.i.d. draws of training data that are drawn from \mathcal T. \mathcal T is a giant joint distribution.
The discriminative part:
Instead of worrying about the giant joint distribution, lets just think about the conditional distribution of y given some feature values.
Pr(y | \mathbf x) = f(\mathbf x) + \epsilon
Rather than leaving this in its arbitrary form, perhaps we’re willing to assume that Pr(y | \mathbf x) follows some known distributional form that is parameterized by a set of values, \theta.
Example: The Normal Distribution and Linear Regression
We assume that each y_i is generated from
y_i = \mathbf x_i^T \boldsymbol \beta + \epsilon_i
where \epsilon_i is an i.i.d. draw from a normal noise distribution with mean 0 and variance \sigma^2.
Using math of normal distributions, we can say that each y_i is a draw from a normal distribution with mean \mathbf x^T \boldsymbol \beta and variance \sigma^2.
Pr(y_i | \mathbf x_i , \boldsymbol \beta, \sigma^2) = \mathcal N(y_i | \mathbf x_i^T \boldsymbol \beta , \sigma^2)
\mathcal N(y_i | \mathbf x_i^T \boldsymbol \beta , \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left[-\frac{1}{2\sigma^2} (y_i - \mathbf x_i^T \boldsymbol \beta)^2 \right]
Given values of \mathbf x_i, y_i, \boldsymbol \beta, and \sigma^2, we could compute the probability/density with which we would expect to see y_i generated from this distribution!
Over a collection of feature/outcome pairs, we can easily compute the joint density with which we would expect to see all of the values if we assume that each of the draws are i.i.d. from a common data generating distribution
Pr(\mathbf y | \mathbf X , \boldsymbol \beta, \sigma^2) = \prod \limits_{i = 1}^N \mathcal N(y_i | \mathbf x_i^T \boldsymbol \beta , \sigma^2)
We could compute this probability if we knew the parameters
However, it makes sense that we could use this specification to choose optimal values of the parameters!
A few notes:
The negative log-likelihood ranges from 0 to \infty and only achieves zero if all outcomes are perfectly explained with no variation by the features and the parameters
This means that the negative log-likelihood is a proper loss function and can be used to choose an ERM!
The NLL is a proper scoring rule. See 14.2 in PML2.
Minimize the NLL to get the ERM.
Let’s show the MLE for \boldsymbol \beta for a linear regression model.
\hat{\boldsymbol \beta} = \underset{\boldsymbol \beta^*}{\text{argmax }}\ell(\boldsymbol \beta^* | \mathbf X , \mathbf y , \sigma^2)
Define the joint likelihood:
\ell(\boldsymbol \beta^* | \mathbf X , \mathbf y , \sigma^2) = \prod \limits_{i = 1}^N \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left[-\frac{1}{2\sigma^2} (y_i - \mathbf x_i^T \boldsymbol \beta)^2 \right]
Compute the log likelihood:
\ell\ell(\boldsymbol \beta^* | \mathbf X , \mathbf y , \sigma^2) = \sum \limits_{i = 1}^N - \log \sqrt{2 \pi \sigma^2} -\frac{1}{2\sigma^2} (y_i - \mathbf x_i^T \boldsymbol \beta)^2
Note that this is a sum of inner products. We can reexpress this as:
\ell\ell(\boldsymbol \beta^* | \mathbf X , \mathbf y , \sigma^2) = - N \log \sqrt{2 \pi \sigma^2} - \frac{1}{2\sigma^2} (\mathbf y - \mathbf X \boldsymbol \beta^*)^T(\mathbf y - \mathbf X \boldsymbol \beta^*)
Taking the derivative w.r.t. \boldsymbol \beta^* and setting to \mathbf 0:
-\frac{1}{2\sigma^2}\left[-2\mathbf X^T \mathbf y + 2 \mathbf X^T \mathbf X \boldsymbol \beta^*\right] = \mathbf 0
Solving:
\hat{\boldsymbol \beta} = (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y
Two interesting points:
The MLE for the coefficients is the same as minimizing the mean squared error
The negative log likelihood is just MSE up to a constant!
-\ell\ell(\boldsymbol \beta^* | \mathbf X , \mathbf y , \sigma^2) = C + \frac{1}{N} \sum \limits_{i = 1}^N (y_i - \mathbf x_i^T \boldsymbol \beta)^2
Maximizing the likelihood of the Gaussian regression model is equivalent to minimizing the MSE.
A final point:
We can express the linear regression model as a chain of equations.
-\log Pr(\mathbf y | \mathbf X, \boldsymbol \beta , \sigma^2) = -\sum \log \mathcal N(y_i | z_i , \sigma^2)
z_i = \mathbf x_i^T \boldsymbol \beta
This is useful because of the derivative chain rule:
\frac{\partial Pr(y | \mathbf x, \boldsymbol \beta , \sigma^2)}{\partial \boldsymbol \beta} = \frac{\partial (y | \mathbf x, \boldsymbol \beta , \sigma^2)}{\partial z} \frac{\partial z}{\partial \boldsymbol \beta}
Can we do the same thing for binary logistic regression?
To make logistic regression work, we assume that each 0/1 outcome is an i.i.d. draw from a Bernoulli distribution:
Pr(y_i | \mathbf x_i , \boldsymbol \beta) = \theta_i^{y_i} (1 - \theta_i)^{1 - y_i}
where
\theta_i = \frac{\exp[z_i]}{1 + \exp[z_i]}
and
z_i = \mathbf x_i^T \boldsymbol \beta
Pr(y_i | \mathbf x_i , \boldsymbol \beta) = \theta_i^{y_i} (1 - \theta_i)^{1 - y_i}
Take a minute:
Can you derive the negative log likelihood for the Bernoulli distribution in terms of \theta?
\ell(\boldsymbol \beta | \mathbf X , \mathbf y) = \prod \limits_{i = 1}^N \theta_i^{y_i} (1 - \theta_i)^{1- y_i}
The binary cross-entropy:
-\frac{1}{N} \sum \limits_{i = 1}^N y_i \log \theta_i + (1 - y_i) \log (1 - \theta_i)
is a viable measure of loss for two-class problems!
Can we find an analytical solution for the MLE coefficients?
Let’s use the chain rule:
\frac{\partial - \ell \ell(\boldsymbol \beta | \mathbf X , \mathbf y)}{\partial \boldsymbol \theta} \frac{\partial \boldsymbol \theta}{\partial \mathbf z} \frac{\partial \mathbf z}{\partial \boldsymbol \beta}
A note:
Because our initial objective function is a sum, we can split this problem into a sum of derivatives:
\sum \limits_{i = 1}^N \frac{\partial - \ell \ell(\boldsymbol \beta | \mathbf x_i , y_i)}{\partial \theta_i} \frac{\partial \theta_i}{\partial z_i} \frac{\partial z_i}{\partial \boldsymbol \beta}
Let’s start with the first term:
\frac{\partial - \ell \ell(\boldsymbol \beta | \mathbf x_i , y_i)}{\partial \theta_i} = - \left(\frac{y_i}{\theta_i} - \frac{1 - y_i}{1 - \theta_i} \right)
The second term:
\frac{d}{d z_i}\frac{\exp[z_i]}{1 + \exp[z_i]}
can be solved using the logarithmic derivative identity:
f(w) \frac{d}{dw} \log f(w) = \frac{d}{dw} f(w)
Applying this:
\frac{\exp[z_i]}{1 + \exp[z_i]}\left[1 - \frac{\exp[z_i]}{1 + \exp[z_i]} \right] = \theta_i(1 - \theta_i)
The third term:
\frac{\partial \mathbf x_i^T \boldsymbol \beta}{\partial \boldsymbol \beta} = \mathbf x_i
Putting this all together:
\frac{\partial - \ell \ell(\boldsymbol \beta | \mathbf X , \mathbf y)}{\boldsymbol \beta} = \sum \limits_{i = 1}^N - \left(\frac{y_i}{\theta_i} - \frac{1 - y_i}{1 - \theta_i} \right) \theta_i(1 - \theta_i) \mathbf x_i
which can be simplified to:
\frac{\partial - \ell \ell(\boldsymbol \beta | \mathbf X , \mathbf y)}{\boldsymbol \beta} = \sum \limits_{i = 1}^N - (y_i - \theta_i) \mathbf x_i
Plugging in our definition of \theta_i:
\frac{\partial - \ell \ell(\boldsymbol \beta | \mathbf X , \mathbf y)}{\boldsymbol \beta} = - \sum \limits_{i = 1}^N \left( y_i - \frac{\exp[\mathbf x_i^T \boldsymbol \beta]}{1 + \exp[\mathbf x_i^T \boldsymbol \beta]} \right) \mathbf x_i
Set this equal to zero and solve for \boldsymbol \beta!
That’s not going to work…
However, we have a gradient for the coefficients and can implement gradient descent algorithms to find the minimum!
We can set multinomial logistic regression up in a similar way leveraging a categorical distribution
\log Pr(y_i = k | \theta_i) = \sum \limits_{k = 1}^K I(y_i = k) \log \theta_k
\theta_k = \frac{\exp[z_k]}{\sum \limits_{h = 1}^K \exp[z_h]}
z_k = \mathbf x_i^T \boldsymbol \beta_k
Like logistic regression, there is a unique (with an additional constraint) MLE
We just can’t find it analytically
The gradients for the coefficients are left for your 2nd homework
Linear models are the most common machine learning method because of their ease of implementation
Find a likelihood model for the outcome that corresponds to your loss function of choice
Use the likelihood to find a set of coefficients for the P features included in \phi(\mathbf x)
It’s really that simple!
They are also so popular because they are easy to regularize
Regularization is a practice in convex optimization that converts an ill posed optimization problem to one that is “regular” and can be solved via standard optimization methods.
It has also found a home in the machine learning community as it can be used to force models fit via ERM to find solutions that generalize well (or at least better than the true ERM solution)
We can define a regularized loss function as one that adds an additional penalty to the loss as the complexity of the model increases. With a negative log-likelihood, the penalized loss function becomes:
\text{NLL} + \lambda C(\boldsymbol \Theta)
where C(\cdot) is a function that increases as the model becomes more complex and \lambda is a ttunable knob.
For linear models with continuous outcomes, we can define the complexity of the model in terms of the projection matrix that maps the training features to the outcomes.
Remember that this matrix encodes the self-correlation between the training outcome and its prediction
When we overfit, we’re relying too heavily on the training data
\text{Generalization Error} = \text{Training Error} + \frac{2}{N} \sum Cov[y_i, \hat{y}_i]
\hat{y} = \mathbf S \mathbf y
For OLS:
\mathbf S = \mathbf X (\mathbf X^T \mathbf X)^{-1} \mathbf X^T
\frac{2}{N}\sum Cov[y_i, \hat{y}_i] = \frac{2\sigma^2}{N}\text{tr}(\mathbf S) = \frac{2\sigma^2P}{N}
The complexity of a linear model has something to do with the number of coefficients!
So, our loss function should try to correct for this difference by placing a penalty on the number of parameters
Account for the difference between the empirical risk and the true risk
The problem, though, revolves around how to place the penalty.
Approach 1:
Choose M \le P features to keep in the model and select the one with the lowest generalization error. Try for all possible combinations or perform greedy search
Problems?
Rather than directly choosing features to leave in or out, regularize the search to create a path of models between the empty model and the full model estimated via OLS.
The path over models is over a proper norm of the coefficient vector
Smaller norms mean less complex models/less features
Larger norms mean more complex models with more features
Two common norm choices:
\text{NLL} + \lambda \|\boldsymbol \beta\|_2^2
\text{NLL} + \lambda \|\boldsymbol \beta\|_1
The effect of implementing this norm penalty is that the norm of the ERM coefficients shrinks towards zero
Both choices try to replicate the gap between training error and generalization error!
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.404
Model: OLS Adj. R-squared: 0.365
Method: Least Squares F-statistic: 10.49
Date: Thu, 23 Jan 2025 Prob (F-statistic): 7.26e-09
Time: 13:12:25 Log-Likelihood: -364.25
No. Observations: 100 AIC: 742.5
Df Residuals: 93 BIC: 760.7
Df Model: 6
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 3.0391 0.958 3.172 0.002 1.136 4.942
x1 -3.5738 12.524 -0.285 0.776 -28.443 21.296
x2 -3.5332 9.878 -0.358 0.721 -23.150 16.083
x3 21.2895 12.638 1.685 0.095 -3.807 46.386
x4 3.7228 9.049 0.411 0.682 -14.247 21.693
x5 0.8834 9.869 0.090 0.929 -18.715 20.481
x6 -14.4625 9.005 -1.606 0.112 -32.345 3.420
==============================================================================
Omnibus: 1.960 Durbin-Watson: 2.160
Prob(Omnibus): 0.375 Jarque-Bera (JB): 1.502
Skew: -0.289 Prob(JB): 0.472
Kurtosis: 3.163 Cond. No. 37.8
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We see that when \lambda is small, we largely recover OLS
As \lambda increases, the ERM for both norm penalties shrinks
Finally, when \lambda is large, there is set of coefficients that yields a small enough MSE to offset the complexity penalty so they are all set to zero.
All in all, this strategy works quite well for yielding more generalizable linear models
Use cross-validation or other approaches to choose a value of \lambda given the penalty that minimizes generalization error
There always exists some non-zero \lambda that yields lower generalization error for L2 penalty and MSE
Most of the time, L1 will decrease generalization error.
Why does regularization work?
But, the penalties seem kinda arbitrary
Personally, I’ve found all non-Bayesian explanations to be quite underwhelming…
How do we actually find the coefficients for a logisitic regression model?
Gradients and Hessians
Gradient Descent and Stochastic Gradient Descent
2nd order methods
Adaptive momentum methods